Synopsis

This project examines economic trends in Major League Baseball. Data sources include the Lahman database (for salaries, performance, and other info), Baseball Reference (for Wins Above Replacement), Forbes (for financials), the US Census Bureau (for US household income), and the Bureau of Labor Statistics (for Consumer Price Index). More details about data sources are on the project website (https://sites.google.com/view/bst260mlm). The data exploration includes trends in player salary, team finances, team performance, and the relationship between salary/spending and player/team performance.

Player Salaries

Note that salary information is only available from 1985 forward.

min(salaries$yearID)
## [1] 1985

We can observe salary trends over time using player salary data from the Lahman dataset, and US household salary data.

Using the respective 1985 salaries as a baseline, we see that US median household income has increased by around 25% every 5 years. Growth in median player salary has generally outpaced the salary growth of US households until the past decade. Average salary, in contrast, has increased at a greater rate since 1990. This suggests that salary trends for “top” players may be different from that of “typical” players. Let’s see if this is true.

Observing the trends by percentile, we can see that, indeed, salaries have not increased at the same rate across the board, with large gains for the top paid players, with little change for players in the lower percentiles. We will now look at a comparison of percentiles by team.

On a team-by-team basis, we can see that salary growth was not the same between the median and the 90th salary percentiles. In particular, there was a decreasing trend in median salary between 1991 and 1995, where the 90th percentile did not change. What may have caused this?

Using the same data to plot the ratio of the 90th percentile salary to the median, we can see an increasing trend until 1995, the year after the baseball strike of 1994, then a decreasing trend afterwards. Again, percentiles were calculated by team.

Is this a real trend? Let’s check with a piecewise linear spline regression (2 methods).

## 
## Call:
## lm(formula = p90rat ~ yearID + yearID * (yearID > 1994.5), data = salteam)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4831 -2.6670 -0.7435  1.4938 21.7437 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -1.404e+03  1.629e+02  -8.620   <2e-16 ***
## yearID                      7.081e-01  8.185e-02   8.652   <2e-16 ***
## yearID > 1994.5TRUE         1.528e+03  1.697e+02   9.005   <2e-16 ***
## yearID:yearID > 1994.5TRUE -7.664e-01  8.526e-02  -8.989   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.847 on 914 degrees of freedom
## Multiple R-squared:  0.1634, Adjusted R-squared:  0.1607 
## F-statistic:  59.5 on 3 and 914 DF,  p-value: < 2.2e-16
## 
##  ***Regression Model with Segmented Relationship(s)***
## 
## Call: 
## segmented.lm(obj = splfit2, seg.Z = ~yearID, psi = 1994)
## 
## Estimated Break-Point(s):
##      Est.   St.Err 
## 1994.396    0.723 
## 
## Meaningful coefficients of the linear terms:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.404e+03  1.629e+02  -8.620   <2e-16 ***
## yearID       7.081e-01  8.185e-02   8.652   <2e-16 ***
## U1.yearID   -7.664e-01  8.526e-02  -8.989       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.847 on 914 degrees of freedom
## Multiple R-Squared: 0.1634,  Adjusted R-squared: 0.1607 
## 
## Convergence attained in 2 iterations with relative change 0

Looks like there are different trends before and after the strike. Modeling a discontinuous regression also illustrates this break.

## 
## Call:
## RDestimate(formula = p90rat ~ yearID, data = salteam, cutpoint = 1994.5)
## 
## Type:
## sharp 
## 
## Estimates:
##            Bandwidth  Observations  Estimate  Std. Error  z value
## LATE        6.049     334           2.872     1.0894      2.636  
## Half-BW     3.024     166           4.335     1.6445      2.636  
## Double-BW  12.097     618           1.708     0.7535      2.266  
##            Pr(>|z|)    
## LATE       0.008394  **
## Half-BW    0.008395  **
## Double-BW  0.023436  * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## F-statistics:
##            F      Num. DoF  Denom. DoF  p        
## LATE       34.93  3         330         0.000e+00
## Half-BW    11.26  3         162         1.901e-06
## Double-BW  74.99  3         614         0.000e+00

Team Revenues

Following the strike of 1994, a two-pronged plan was put in place to limit the effect of reduced revenues. First was to enact revenue-sharing in order to keep smaller teams from struggling financially. The other was a luxury tax, which was a way to limit the spending of the big teams without explicitly enforcing a salary cap.

Let’s look at how team/franchise revenue has changed over time. Data comes from Michael Ozanian, a writer for Forbes, who has been compiling annual financial data for Major League Baseball since 1990. Numbers have been adjusted to 1990 dollars using the Consumer Price Index (CPI).

We see that there is a drop in revenues in the year of the stike, but this recovers after a few years. Note that the 2012 revenue data is incomplete and skewed upwards. What about franchise values?

We can see that the strike did not drastically affect franchise values. Soon after the strike, as revenues recovered, valuations rose greatly after the revenue sharing agreement, and again after 2010 (note that the 2012 valuation data is complete). Let’s look back at revenues. How did revenues for each team change from year to year?

We can see that teams mostly have increasing revenues year-on-year, with the biggest increases in 1997 and 2014. The other goal of the revenue sharing agreement was to prevent losses. Was this accomplished? We will look at this information over time, stratifying by Collective Bargaining Agreement (CBA), the contract between team owners that sets the financial rules of the league.

We see that the percentage of teams suffering either an operating loss or loss in franchiase value has decreased over time, particularly following the enactment of CBA10. Is this a real trend? Let’s check.

## 
## Call:
## lm(formula = teams ~ Year, data = oloss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.28988 -0.10506  0.01310  0.05647  0.29085 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.710471   7.714400   3.851 0.000767 ***
## Year        -0.014688   0.003852  -3.813 0.000845 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1473 on 24 degrees of freedom
## Multiple R-squared:  0.3772, Adjusted R-squared:  0.3513 
## F-statistic: 14.54 on 1 and 24 DF,  p-value: 0.0008448
## 
## Call:
## lm(formula = teams ~ Year, data = deval)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.279949 -0.084264 -0.006249  0.103421  0.230952 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 32.617231   8.244450   3.956 0.000627 ***
## Year        -0.016183   0.004116  -3.932 0.000666 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1484 on 23 degrees of freedom
## Multiple R-squared:  0.402,  Adjusted R-squared:  0.376 
## F-statistic: 15.46 on 1 and 23 DF,  p-value: 0.0006664

The linear models suggest a decreasing trend, with the percentage of teams suffering operating/valuation loss decreasing by ~1.5% every year, for a ~30 percentage point decrease over 20 years (not using compounding). Let’s look at differences between the different CBA periods.

##                Df Sum Sq Mean Sq F value Pr(>F)   
## as.factor(cba)  4 0.3586 0.08966   6.016 0.0024 **
## Residuals      20 0.2981 0.01490                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 observation deleted due to missingness
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = teams ~ as.factor(cba), data = oloss)
## 
## $`as.factor(cba)`
##              diff        lwr         upr     p adj
## 8-7    0.04963370 -0.1740707  0.27333810 0.9618847
## 9-7   -0.09322344 -0.3515350  0.16508815 0.8145533
## 10-7  -0.27655678 -0.5348684 -0.01824518 0.0322181
## 11-7  -0.17322344 -0.4182793  0.07183245 0.2523730
## 9-8   -0.14285714 -0.3665615  0.08084726 0.3437435
## 10-8  -0.32619048 -0.5498949 -0.10248607 0.0024846
## 11-8  -0.22285714 -0.4311146 -0.01459968 0.0323258
## 10-9  -0.18333333 -0.4416449  0.07497826 0.2489516
## 11-9  -0.08000000 -0.3250559  0.16505589 0.8624128
## 11-10  0.10333333 -0.1417226  0.34838923 0.7164209
##                Df Sum Sq Mean Sq F value  Pr(>F)   
## as.factor(cba)  4 0.4607 0.11519    6.28 0.00214 **
## Residuals      19 0.3485 0.01834                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 observation deleted due to missingness
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = teams ~ as.factor(cba), data = deval)
## 
## $`as.factor(cba)`
##              diff        lwr         upr     p adj
## 8-7   -0.31607143 -0.5917918 -0.04035104 0.0200669
## 9-7   -0.40714286 -0.7181974 -0.09608831 0.0069843
## 10-7  -0.34047619 -0.6515307 -0.02942164 0.0278618
## 11-7  -0.47714286 -0.7745679 -0.17971782 0.0009916
## 9-8   -0.09107143 -0.3404699  0.15832705 0.8053502
## 10-8  -0.02440476 -0.2738032  0.22499371 0.9982114
## 11-8  -0.16107143 -0.3932488  0.07110592 0.2660313
## 10-9   0.06666667 -0.2213139  0.35464722 0.9548525
## 11-9  -0.07000000 -0.3432023  0.20320234 0.9360068
## 11-10 -0.13666667 -0.4098690  0.13653568 0.5721522

The Tukey multiple comparisons show that valuation losses were highest during CBA7 and significantly lower for all periods thereafter. The period differences in operating losses are less clear, although CBA10 is significantly lower than CBA7 and CBA8.

sources:

https://www.sbnation.com/2010/8/30/1065675/8-30-2002-baseball-avoids-another https://www.fangraphs.com/tht/a-history-of-the-collective-bargaining-agreement-part-3/

The analysis suggests that the measures taken in successive CBAs since the 1994 strike have improved the financial sitation of the league. Did this affect spending on players?

We can see that the standardized distribution of payrolls across the league remained fairly consistent until after 2001, after which there was one team (the Yankees, obviously) with a much higher spend for several years.

Looking at percentage of revenue spent, it appears that player payrolls have been around 50% of franchise revenue for the past 20 years, with the exception of the strike year and year immediately following, where revenues took a hit.

Performance

As a measure of player performance, we will use Wins Above Replacement (WAR). The calculation is very involved and not standardized, so the WAR numbers were taken from the Baseball Refernce database (not the same as the Lahman Database).

First, let’s look at how WAR distribution across the league has changed over time.

We can see that WAR variance decreases over time, and stays low after 1996. This could be due to the revenue-sharing agreements, or perhaps due to teams being better about picking players. Regardless, it would seem that the post-strike era is more competitive than before the strike. Let’s explore this further.

First, let’s look at median WAR by team, sorted by summed WAR prior to the year 2000.

From the heat map, it seems that the teams that consistently had better players before the strike (marked by the dotted white line) lost that advantage after the strike. Let’s look at the better players (75th percentile) in each team.

The change is less evident here. Let’s see if this affected team performance.

There is no disernable pattern here. No team is overly dominant in the league, either before or after the strike. Case in point: Houston, which above is visibly the worst team between 2010-2015, just won the World Series.

What kind of players were paid more? Let’s look at the year 1995.

The above visualization sounded like a better idea than it turned out to be. Let’s look at a table instead.

##                 Name      Team Salary
## 1      Cecil Fielder    Tigers   $9.2
## 2        Barry Bonds    Giants   $8.2
## 3         David Cone Blue Jays     $8
## 4        Ken Griffey  Mariners   $7.6
## 5       Frank Thomas White Sox   $7.2
## 6       Jeff Bagwell    Astros   $6.9
## 7       Mark McGwire Athletics   $6.9
## 8         Cal Ripken   Orioles   $6.7
## 9        Greg Maddux    Braves   $6.5
## 10     Kirby Puckett     Twins   $6.3
## 11     Lenny Dykstra  Phillies   $6.2
## 12      Barry Larkin      Reds   $5.9
## 13      Jose Canseco   Red Sox   $5.8
## 14   Bret Saberhagen      Mets   $5.6
## 15    Gary Sheffield   Marlins   $5.6
## 16        Will Clark   Rangers   $5.6
## 17     Jack McDowell   Yankees   $5.4
## 18 Darryl Strawberry   Dodgers   $5.3
## 19     Mark Langston    Angels     $5
## 20      Larry Walker   Rockies     $5
## 21       Greg Vaughn   Brewers   $4.9
## 22      Wally Joyner    Royals   $4.8
## 23        Tony Gwynn    Padres   $4.7
## 24   Dennis Martinez   Indians   $4.6
## 25          Ken Hill Cardinals   $4.5
## 26          Jay Bell   Pirates   $4.4
## 27        Mark Grace      Cubs   $4.4
## 28       Moises Alou     Expos     $3

The max salaries for each team went to future Hall of Famers Ken Griffey Jr., Barry Larkin, Kirby Puckett, and Cal Ripken, Jr., as well as other popular players of the time, including Barry Bonds, Jose Canseco, Mark McGwire, and Daryl Strawberry. So, it seems that salaries went to popular players, who were presumably also the players who played well.

Here’s a word cloud, with sizes/color corresponding to salary:

Moving on.

Money vs. Performance

[Note that I left in a lot of data exploration in this section, so it will be a bit messier than above.]

Do better players make more? Let’s first look at how WAR may have influenced changes in salary.

#Previously calculated salary changes and performance changes year-to-year and saved in warplus.csv
warstats <- wars %>% filter(pitcher=="N")
#warstats <- wars
warstats <- warstats[c("playerID","yearID","teamID",
                       "WAA","WAA_off","WAA_def",
                       "WAR","WAR_off","WAR_def",
                       "prevWAA","prevWAA_off","prevWAA_def",
                       "prevWAR","prevWAR_off","prevWAR_def",
                       "dWAA","dWAA_off","dWAA_def",
                       "dWAR","dWAR_off","dWAR_def",
                       "OPS_plus","salary","bump","cut","dsal","pdsal")]
warstats <- warstats %>% filter(yearID > 1985)

warstats <- warstats %>% filter(dsal!=0)
warstats <- warstats %>% filter(abs(dsal)>1)

warcut <- warstats %>% filter(dsal<0)
warcut %>% ggplot() + geom_histogram(aes(as.numeric(prevWAR)),binwidth=0.05) +
  labs(title="Players who received a salary decrease",x="Wins Above Replacement")

warbump <- warstats %>% filter(dsal>0)
warbump %>% ggplot() + geom_histogram(aes(as.numeric(prevWAR)),binwidth=0.05) +
  labs(title="Players who received a salary increase",x="Wins Above Replacement")

All of the performance stats (including Wins Above Average (WAA), offensive and defensive stats, and previous and current performance/salary) listed in the above chunk of code were tested. No obvious trend was there. Let’s look at other trends.

warstats$pdsal[which(warstats$pdsal==Inf)] = NA

warstats %>% filter(pdsal < 40, dsal > -100000) %>% ggplot() + 
  geom_point(aes(prevWAA_off,dsal + abs(min(dsal))+1 )) +
  geom_smooth(aes(prevWAA_off,dsal + abs(min(dsal))+1 )) +
  scale_y_continuous(trans="log10") +
  labs(x="WAA in previous year", y="Salary change ($, log millions)")

warstats %>% ggplot() + 
  geom_point(aes(prevWAA_off,dsal/1000000)) +
  geom_smooth(aes(prevWAA_off,dsal/1000000)) +
  scale_y_continuous(breaks=seq(-30,30,5)) +
  labs(x="WAA in previous year", y="Salary change ($, millions)")

warstats %>% ggplot() + 
  geom_point(aes(prevWAR_off,dsal/1000000)) +
  geom_smooth(aes(prevWAR_off,dsal/1000000)) +
  scale_y_continuous(breaks=seq(-30,30,5)) +
  labs(x="WAR in previous year", y="Salary change ($, millions)")

warstats %>% ggplot() + 
  geom_point(aes(prevWAR,salary/1000000)) +
  geom_smooth(aes(prevWAR,salary/1000000)) +
  scale_y_continuous(breaks=seq(-30,30,5)) +
  labs(x="WAR in previous year", y="Current salary ($, millions)")

There may be an increasing trend. Let’s look at a regression.

summary(lm(warstats$salary~warstats$prevWAR))
## 
## Call:
## lm(formula = warstats$salary ~ warstats$prevWAR)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -8716085 -1365840  -762866   220743 27749911 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1133504      34353    33.0   <2e-16 ***
## warstats$prevWAR   752798      14847    50.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3103000 on 12132 degrees of freedom
## Multiple R-squared:  0.1749, Adjusted R-squared:  0.1748 
## F-statistic:  2571 on 1 and 12132 DF,  p-value: < 2.2e-16

There is a significant effect, but the R-squared is not convincing. What about for a single year?

warstats %>% filter(yearID==1991) %>% ggplot() + 
  geom_point(aes(prevWAR,salary/1000000)) +
  geom_smooth(aes(prevWAR,salary/1000000)) +
  scale_y_continuous(breaks=seq(-30,30,5),trans="log10") +
  labs(x="WAR in previous year", y="Salary change ($, millions)")

w2 <- warstats %>% filter(yearID==1991)
summary(lm(w2$dsal~w2$prevWAR))
## 
## Call:
## lm(formula = w2$dsal ~ w2$prevWAR)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1318664  -198423   -80652    88553  2004829 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   161395      26878   6.005 4.81e-09 ***
## w2$prevWAR    135519      11234  12.063  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 407100 on 349 degrees of freedom
## Multiple R-squared:  0.2943, Adjusted R-squared:  0.2922 
## F-statistic: 145.5 on 1 and 349 DF,  p-value: < 2.2e-16

Sure enough, this looks a bit more convincing. Let’s look at this over several years.

##                Df  Sum Sq Mean Sq F value   Pr(>F)    
## as.factor(cba)  4 0.22894 0.05724    36.4 4.39e-10 ***
## Residuals      25 0.03932 0.00157                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 observation deleted due to missingness
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = adj.r.squared ~ as.factor(cba), data = regres)
## 
## $`as.factor(cba)`
##              diff          lwr          upr     p adj
## 8-7    0.06186162  0.003628854  0.120094394 0.0333907
## 9-7   -0.01380545 -0.085125738  0.057514835 0.9784690
## 10-7  -0.13789710 -0.209217387 -0.066576814 0.0000596
## 11-7  -0.16410402 -0.227002599 -0.101205438 0.0000005
## 9-8   -0.07566708 -0.146987362 -0.004346789 0.0336931
## 10-8  -0.19975872 -0.271079011 -0.128438438 0.0000001
## 11-8  -0.22596564 -0.288864223 -0.163067062 0.0000000
## 10-9  -0.12409165 -0.206445222 -0.041738075 0.0014244
## 11-9  -0.15029857 -0.225476750 -0.075120384 0.0000367
## 11-10 -0.02620692 -0.101385102  0.048971265 0.8420085

We see that the correlation between current salary and WAR in the previous year changes over time, with greater correlation during CBA6-CBA9, and less thereafter.

##                Df  Sum Sq Mean Sq F value   Pr(>F)    
## as.factor(cba)  4 0.14687 0.03672   33.12 1.18e-09 ***
## Residuals      25 0.02772 0.00111                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 observation deleted due to missingness
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = adj.r.squared ~ as.factor(cba), data = regres)
## 
## $`as.factor(cba)`
##                diff         lwr         upr     p adj
## 8-7    0.1071798491  0.05828624  0.15607346 0.0000090
## 9-7    0.0114611309 -0.04842107  0.07134333 0.9793432
## 10-7  -0.0748273792 -0.13470958 -0.01494518 0.0092462
## 11-7  -0.0740462640 -0.12685740 -0.02123513 0.0030796
## 9-8   -0.0957187182 -0.15560092 -0.03583652 0.0007211
## 10-8  -0.1820072283 -0.24188943 -0.12212503 0.0000000
## 11-8  -0.1812261131 -0.23403725 -0.12841497 0.0000000
## 10-9  -0.0862885101 -0.15543452 -0.01714250 0.0093555
## 11-9  -0.0855073949 -0.14862878 -0.02238601 0.0043512
## 11-10  0.0007811152 -0.06234027  0.06390250 0.9999996

Looking at salary and performance within the same year, we see that the highest correlation was during CBA8, though there were significant differences between periods.

Why would this correlation change with the collective bargaining agreements? Perhaps it had to do with luxury tax. Or perhaps teams simply got better at paying players less. Perhaps WAR doesn’t tell the whole story.

Now let’s look at how salary may have affected team performances, in successive five-year periods. First, win percentage against salary, with salary standardized by year.

#yearly standard
require(gridExtra)
date1 = seq(1985,2010,5)
date2 = seq(1989,2014,5)
dates = data.frame(date1,date2)
plots = list()

for(i in 1:6){
  tmpsal <- salteam3 %>% filter(yearID > dates$date1[[i]] & yearID <= dates$date2[[i]])
  tmpsal <- tmpsal %>% mutate(stdsal = (totsal - mutoty)/sdtoty)
  wlres <- tmpsal %>% lm(wpct ~ stdsal,data=.)
  print(summary(wlres))
  label = paste(dates$date1[[i]],dates$date2[[i]],sep="-")
  myplot <- tmpsal %>% ggplot() + geom_point(aes(stdsal,wpct)) + geom_smooth(aes(stdsal,wpct)) + labs(title=label,x="Standardized Salary",y="Win Percentage")
  newvar = paste("myplot",i,sep="")
  assign(newvar,myplot)
  #print(myplot)
}
## 
## Call:
## lm(formula = wpct ~ stdsal, data = .)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.170968 -0.041127  0.003113  0.043322  0.154448 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.499950   0.006221  80.363   <2e-16 ***
## stdsal      0.011004   0.006344   1.734   0.0859 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06344 on 102 degrees of freedom
## Multiple R-squared:  0.02865,    Adjusted R-squared:  0.01913 
## F-statistic: 3.008 on 1 and 102 DF,  p-value: 0.08585
## 
## 
## Call:
## lm(formula = wpct ~ stdsal, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.14919 -0.05129 -0.01026  0.05134  0.17869 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.500060   0.006151  81.299  < 2e-16 ***
## stdsal      0.017996   0.006268   2.871  0.00494 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06392 on 106 degrees of freedom
## Multiple R-squared:  0.07215,    Adjusted R-squared:  0.0634 
## F-statistic: 8.243 on 1 and 106 DF,  p-value: 0.00494
## 
## 
## Call:
## lm(formula = wpct ~ stdsal, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.16329 -0.04518  0.00182  0.03722  0.14106 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.499952   0.005431   92.05  < 2e-16 ***
## stdsal      0.039849   0.005527    7.21 6.57e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0585 on 114 degrees of freedom
## Multiple R-squared:  0.3132, Adjusted R-squared:  0.3071 
## F-statistic: 51.98 on 1 and 114 DF,  p-value: 6.57e-11
## 
## 
## Call:
## lm(formula = wpct ~ stdsal, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.20606 -0.05451  0.01146  0.05256  0.20215 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.500000   0.006899  72.477  < 2e-16 ***
## stdsal      0.036670   0.007017   5.226 7.57e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07557 on 118 degrees of freedom
## Multiple R-squared:  0.188,  Adjusted R-squared:  0.1811 
## F-statistic: 27.31 on 1 and 118 DF,  p-value: 7.566e-07
## 
## 
## Call:
## lm(formula = wpct ~ stdsal, data = .)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.145329 -0.041975  0.002131  0.041144  0.134287 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.499977   0.005218  95.812  < 2e-16 ***
## stdsal      0.029380   0.005308   5.536  1.9e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05716 on 118 degrees of freedom
## Multiple R-squared:  0.2061, Adjusted R-squared:  0.1994 
## F-statistic: 30.64 on 1 and 118 DF,  p-value: 1.904e-07
## 
## 
## Call:
## lm(formula = wpct ~ stdsal, data = .)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.149096 -0.038805  0.007312  0.052624  0.114660 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.499994   0.006038  82.801  < 2e-16 ***
## stdsal      0.021162   0.006142   3.446  0.00079 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06615 on 118 degrees of freedom
## Multiple R-squared:  0.09141,    Adjusted R-squared:  0.08371 
## F-statistic: 11.87 on 1 and 118 DF,  p-value: 0.0007899
grid.arrange(myplot1,myplot2,myplot3,myplot4,myplot5,myplot6,ncol=3)

Next, win percentage against salary, with salary standardized over each 5-year period.

#5 year standard
date1 = seq(1985,2010,5)
date2 = seq(1989,2014,5)
dates = data.frame(date1,date2)

for(i in 1:6){
  tmpsal <- salteam2 %>% filter(yearID > dates$date1[[i]] & yearID <= dates$date2[[i]])
  meansdsal <- tmpsal %>% summarize(meansal = mean(totsal), sdsal = sd(totsal))
  tmpsal <- tmpsal %>% mutate(stdsal = (totsal - meansdsal$meansal)/meansdsal$sdsal)
  wlres <- tmpsal %>% lm(wpct ~ stdsal,data=.)
  print(summary(wlres))
  label = paste(dates$date1[[i]],dates$date2[[i]],sep="-")
  myplot <- tmpsal %>% ggplot() + geom_point(aes(stdsal,wpct)) + geom_smooth(aes(stdsal,wpct)) + labs(title=label,x="Standardized Salary",y="Win Percentage")
  newvar = paste("myplot",i,sep="")
  assign(newvar,myplot)
  #print(myplot)
}
## 
## Call:
## lm(formula = wpct ~ stdsal, data = .)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.172902 -0.040316  0.003635  0.042368  0.154907 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.499829   0.006214  80.440   <2e-16 ***
## stdsal      0.010701   0.005914   1.809   0.0734 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06336 on 102 degrees of freedom
## Multiple R-squared:  0.03109,    Adjusted R-squared:  0.0216 
## F-statistic: 3.273 on 1 and 102 DF,  p-value: 0.07335
## 
## 
## Call:
## lm(formula = wpct ~ stdsal, data = .)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.148902 -0.047993  0.000505  0.046672  0.163294 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.498574   0.005951  83.773  < 2e-16 ***
## stdsal      0.019611   0.004829   4.061 9.39e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06173 on 106 degrees of freedom
## Multiple R-squared:  0.1346, Adjusted R-squared:  0.1265 
## F-statistic: 16.49 on 1 and 106 DF,  p-value: 9.39e-05
## 
## 
## Call:
## lm(formula = wpct ~ stdsal, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.15143 -0.04235  0.00053  0.04179  0.16215 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.496562   0.005639  88.054  < 2e-16 ***
## stdsal      0.028600   0.004450   6.427 3.13e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06047 on 114 degrees of freedom
## Multiple R-squared:  0.266,  Adjusted R-squared:  0.2595 
## F-statistic: 41.31 on 1 and 114 DF,  p-value: 3.129e-09
## 
## 
## Call:
## lm(formula = wpct ~ stdsal, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.20521 -0.05467  0.01017  0.05316  0.20306 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.499105   0.006890   72.44  < 2e-16 ***
## stdsal      0.036616   0.006948    5.27 6.24e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07545 on 118 degrees of freedom
## Multiple R-squared:  0.1905, Adjusted R-squared:  0.1837 
## F-statistic: 27.77 on 1 and 118 DF,  p-value: 6.242e-07
## 
## 
## Call:
## lm(formula = wpct ~ stdsal, data = .)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.146966 -0.040119  0.005833  0.039018  0.133624 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.499374   0.005238  95.334  < 2e-16 ***
## stdsal      0.028332   0.005209   5.439 2.94e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05737 on 118 degrees of freedom
## Multiple R-squared:  0.2005, Adjusted R-squared:  0.1937 
## F-statistic: 29.59 on 1 and 118 DF,  p-value: 2.938e-07
## 
## 
## Call:
## lm(formula = wpct ~ stdsal, data = .)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.153625 -0.041106  0.005691  0.053141  0.114145 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.499030   0.006087  81.988  < 2e-16 ***
## stdsal      0.017940   0.005633   3.185  0.00185 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06659 on 118 degrees of freedom
## Multiple R-squared:  0.07916,    Adjusted R-squared:  0.07136 
## F-statistic: 10.14 on 1 and 118 DF,  p-value: 0.001852
grid.arrange(myplot1,myplot2,myplot3,myplot4,myplot5,myplot6,ncol=3)

Next, league rank against salary, with salary standardized over each 5-year period.

date1 = seq(1985,2010,5)
date2 = seq(1989,2014,5)
dates = data.frame(date1,date2)

for(i in 1:6){
  tmpsal <- salteam2 %>% filter(yearID > dates$date1[[i]] & yearID <= dates$date2[[i]])
  meansdsal <- tmpsal %>% summarize(meansal = mean(totsal), sdsal = sd(totsal))
  tmpsal <- tmpsal %>% mutate(stdsal = (totsal - meansdsal$meansal)/meansdsal$sdsal)
  wlres <- tmpsal %>% lm(Rank ~ stdsal,data=.)
  print(summary(wlres))
  label = paste(dates$date1[[i]],dates$date2[[i]],sep="-")
  myplot <- tmpsal %>% ggplot() + geom_point(aes(stdsal,Rank)) + geom_smooth(aes(stdsal,Rank)) + labs(title=label,x="Standardized Salary",y="Rank")
  newvar = paste("myplot",i,sep="")
  assign(newvar,myplot)
  #print(myplot)
}
## 
## Call:
## lm(formula = Rank ~ stdsal, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2970 -1.6387 -0.0339  1.6072  3.6072 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.7345     0.1822  20.500   <2e-16 ***
## stdsal       -0.3294     0.1734  -1.899   0.0603 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.858 on 102 degrees of freedom
## Multiple R-squared:  0.03416,    Adjusted R-squared:  0.02469 
## F-statistic: 3.608 on 1 and 102 DF,  p-value: 0.06033
## 
## 
## Call:
## lm(formula = Rank ~ stdsal, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1544 -1.4275 -0.1564  1.2422  4.0194 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.5734     0.1689  21.161  < 2e-16 ***
## stdsal       -0.4798     0.1370  -3.502 0.000678 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.752 on 106 degrees of freedom
## Multiple R-squared:  0.1037, Adjusted R-squared:  0.09523 
## F-statistic: 12.26 on 1 and 106 DF,  p-value: 0.0006778
## 
## 
## Call:
## lm(formula = Rank ~ stdsal, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4576 -0.9606 -0.0107  0.9254  3.6188 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.00601    0.12036  24.976  < 2e-16 ***
## stdsal      -0.48703    0.09497  -5.128 1.21e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.291 on 114 degrees of freedom
## Multiple R-squared:  0.1874, Adjusted R-squared:  0.1803 
## F-statistic:  26.3 on 1 and 114 DF,  p-value: 1.21e-06
## 
## 
## Call:
## lm(formula = Rank ~ stdsal, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7160 -1.1112  0.1837  0.9506  2.8982 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.0492     0.1218  25.027  < 2e-16 ***
## stdsal       -0.6499     0.1229  -5.289 5.73e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.334 on 118 degrees of freedom
## Multiple R-squared:  0.1917, Adjusted R-squared:  0.1848 
## F-statistic: 27.98 on 1 and 118 DF,  p-value: 5.729e-07
## 
## 
## Call:
## lm(formula = Rank ~ stdsal, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.68509 -1.13595  0.01056  1.02661  3.16204 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.0282     0.1241  24.394  < 2e-16 ***
## stdsal       -0.5436     0.1234  -4.404 2.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.36 on 118 degrees of freedom
## Multiple R-squared:  0.1412, Adjusted R-squared:  0.1339 
## F-statistic: 19.39 on 1 and 118 DF,  p-value: 2.351e-05
## 
## 
## Call:
## lm(formula = Rank ~ stdsal, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4240 -1.1755 -0.1448  0.9357  2.6941 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.0421     0.1295  23.496  < 2e-16 ***
## stdsal       -0.3180     0.1198  -2.654  0.00905 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.417 on 118 degrees of freedom
## Multiple R-squared:  0.05633,    Adjusted R-squared:  0.04834 
## F-statistic: 7.044 on 1 and 118 DF,  p-value: 0.009048
grid.arrange(myplot1,myplot2,myplot3,myplot4,myplot5,myplot6,ncol=3)

Next, win-loss ratio against salary, with salary standardized over each 5-year period.

#yearly standard
date1 = seq(1985,2010,5)
date2 = seq(1989,2014,5)
dates = data.frame(date1,date2)

for(i in 1:6){
  tmpsal <- salteam3 %>% filter(yearID > dates$date1[[i]] & yearID <= dates$date2[[i]])
  tmpsal <- tmpsal %>% mutate(stdsal = (totsal - mutoty)/sdtoty)
  wlres <- tmpsal %>% lm(wlratio ~ stdsal,data=.)
  print(summary(wlres))
  label = paste(dates$date1[[i]],dates$date2[[i]],sep="-")
  myplot <- tmpsal %>% ggplot() + geom_point(aes(stdsal,wlratio)) + geom_smooth(aes(stdsal,wlratio)) + labs(title=label,x="Standardized Salary",y="Win-Loss Ratio")
  newvar = paste("myplot",i,sep="")
  assign(newvar,myplot)
  #print(myplot)
}
## 
## Call:
## lm(formula = wlratio ~ stdsal, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55857 -0.17972 -0.01983  0.15692  0.90961 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.03342    0.02611  39.586   <2e-16 ***
## stdsal       0.05109    0.02662   1.919   0.0578 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2662 on 102 degrees of freedom
## Multiple R-squared:  0.03485,    Adjusted R-squared:  0.02539 
## F-statistic: 3.683 on 1 and 102 DF,  p-value: 0.05777
## 
## 
## Call:
## lm(formula = wlratio ~ stdsal, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51801 -0.21742 -0.07618  0.19493  0.93268 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.03698    0.02658  39.011  < 2e-16 ***
## stdsal       0.07270    0.02709   2.684  0.00845 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2762 on 106 degrees of freedom
## Multiple R-squared:  0.06362,    Adjusted R-squared:  0.05479 
## F-statistic: 7.202 on 1 and 106 DF,  p-value: 0.00845
## 
## 
## Call:
## lm(formula = wlratio ~ stdsal, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52790 -0.17900 -0.02483  0.14699  1.05555 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.04262    0.02438  42.758  < 2e-16 ***
## stdsal       0.17596    0.02482   7.091 1.19e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2626 on 114 degrees of freedom
## Multiple R-squared:  0.3061, Adjusted R-squared:    0.3 
## F-statistic: 50.28 on 1 and 114 DF,  p-value: 1.193e-10
## 
## 
## Call:
## lm(formula = wlratio ~ stdsal, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.62279 -0.21269 -0.00028  0.18769  1.40373 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.05802    0.02965  35.684  < 2e-16 ***
## stdsal       0.15825    0.03016   5.248 6.88e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3248 on 118 degrees of freedom
## Multiple R-squared:  0.1892, Adjusted R-squared:  0.1823 
## F-statistic: 27.54 on 1 and 118 DF,  p-value: 6.883e-07
## 
## 
## Call:
## lm(formula = wlratio ~ stdsal, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52235 -0.15494 -0.01333  0.16523  0.61128 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.03274    0.02112  48.897  < 2e-16 ***
## stdsal       0.12556    0.02148   5.845  4.6e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2314 on 118 degrees of freedom
## Multiple R-squared:  0.2245, Adjusted R-squared:  0.2179 
## F-statistic: 34.16 on 1 and 118 DF,  p-value: 4.596e-08
## 
## 
## Call:
## lm(formula = wlratio ~ stdsal, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4709 -0.1866 -0.0040  0.2036  0.5325 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.03796    0.02447  42.412  < 2e-16 ***
## stdsal       0.08536    0.02489   3.429 0.000834 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2681 on 118 degrees of freedom
## Multiple R-squared:  0.09063,    Adjusted R-squared:  0.08292 
## F-statistic: 11.76 on 1 and 118 DF,  p-value: 0.0008343
grid.arrange(myplot1,myplot2,myplot3,myplot4,myplot5,myplot6,ncol=3)

The above analyses seem to indicate better win rates and ranks (with 1 being best) as more money is spent. The correlations vary by period, however, so let’s do another year-by-year breakdown.

Average salary seems to have a greater effect than total salary, though both have similar trends. The greatest correlation is during CBA8. Let’s see about the effect.

Even at its greatest, the effect is relatively small, with about a 4 percent increase in win percentage for a 1 standard deviation increase in average salary in a given year. How about league rank?

The effect size is similarly underwhelming, with a maximum of a ~0.9 increase in league rank for every standard deviation increase in average salary.

The correlation between average salary and spending is generally weak, with most adjusted R-squared values below 0.1.

Overall, the effect of spending more money seems to be small, but perhaps those small margins, combined with other management factors, are enough to translate into a competitive advantage, however transient.

Conclusion

In summary, we have seen that:
(1) Player salaries have increased over time, but the top rates have done so faster than the lower ones.
(2) Financial measures taken following the 1994 strike and in successive CBAs seem to have stabilized revenue streams for individual franchises and the overall growth of the league.
(3) Teams that were previously dominant on paper are less so today, but Major League Baseball remains as competitive as ever.
(4) Paying more money has a small effect on player and team performance, which has decreased over time.